library(tidyverse)
library(RMySQL)
library(lubridate)
Front matter June 3, 2020 at 4:59PM.
Name your files applied_ps_3.Rmd and applied_ps_3.html. (5 pts.)
Follow the style guide (10 pts.)
This submission is our work alone and complies with the 30535 integrity policy.
Add your initials to indicate your agreement: JCI
Add names of anyone you discussed this problem set with: **__**
Submit by pushing your code to your repo on Github Classroom: https://classroom.github.com/g/5vDiXToZ.
Late coins used this pset: X. Late coins left: X.
waze data
You can find the waze data dictionary here.
At the start of the course that you agreed to follow these data usage terms. Here are the most important parts:
Prelim questions
Working with data on a server adds a challenge as you have to make calls to the database which take time to process. A call to the database can be slow for several reasons.
We can adjust for 1 and 2 by testing our code on small subsections of the data.
Next week we will provide an opt out where you can use csv we provide. Using this option will result in a 10 percent discount on your problem set final grade. For example, if you earn \(90\) pts based on your solutions, your final grade will be \(90 \cdot .9 = 81\).
filter() to reduce the amount of data you pull while exploring data. For example, you can filter by time and location to only get data for a small part of the city and/or over a short time period.collect() a small sample data set so that the you have data in memory on your computer.collect() the entire data set each time you want to work with it.I would like to see Waze predict when draw bridges will raise and lower to allow ships to pass under. This is a major problem for people who have to commute over those bridges because it’s so unpredictable. It can easily add 20+ minutes to a commute. As is, Waze doesn’t know that bridges will raise until traffic has stopped and Wazers report it. But if Waze could cooperate with the authorities that control bridges (including cities, states, and the army corps of engineers) they could warn drivers in advance of these delays, allowing users to seek alternate routes. The challenge for adding this new variable would be coordinating with these authorities to generate and share the data; there are likely to be many agencies involved in maintaining these bridges, and some of them may not have the infrastructure that would be needed to report this data. But another approach might be to use live marine traffic data like https://www.marinetraffic.com/ to “guess” when a bridge might need to be raised to allow a ship under.
Read up on the ggmap package, which will be useful for doing these problems. Particularly, get to know the get_stamenmap() function. If you find yourself downloading 1000s of tiles, check your settings. You are welcome to try using google basemaps as well; while free for new users, this will require a credit card. The version of ggmap on CRAN is out of date, instead find and install it from github.
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(RMySQL)
library(stringr)
connection <- DBI::dbConnect(RMySQL::MySQL(),
user = "ppha30531",
dbname="Waze2",
port = 3306,
password = "bUYjwnKXf49M2pb",
host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)
# Show which tables are available from the Waze database
DBI::dbListTables(connection)
## [1] "chiAlerts"
# Create table referencing the data in the chiAlerts table
# of the waze database
chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
#Y bounds: 41.935, 41.896 (W George St to W Chicago Ave
event_data <- chi_alerts_sql %>%
filter(str_detect(street, "N Western Ave"),
41.896 <= location_y & location_y <= 41.934,
type == "ACCIDENT") %>%
collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data, file = "event_data.rds")
event_data <- readRDS(file = "event_data.rds")
# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,0>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data%>%
group_by(street)%>%
summarize(n())
## # A tibble: 1 x 2
## street `n()`
## <chr> <int>
## 1 N Western Ave 315
corridor_7 <- c(left = -87.7, bottom = 41.899 , right =
-87.67, top = 41.934)
corridor_7_stamenmap <- get_stamenmap(data = event_data,
bbox = corridor_7,
zoom = 15,
maptype = "toner-lite")
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
corridor_7_map <- ggmap(corridor_7_stamenmap,
base_layer = ggplot(data = event_data)
)
corridor_7_map
corridor_7_map +
geom_point(aes(location_x, location_y), color = "red")
## Warning: Removed 12 rows containing missing values (geom_point).
connection <- DBI::dbConnect(RMySQL::MySQL(),
user = "ppha30531",
dbname="Waze2",
port = 3306,
password = "bUYjwnKXf49M2pb",
host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)
chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
event_data <- chi_alerts_sql %>%
filter(street %in% c("E 79th St", "W 79th St", "Chicago Ave","E Chicago Ave", "W Chicago Ave")) %>%
collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data, file = "event_data.rds")
event_data <- readRDS(file = "event_data.rds")
# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,1>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data%>%
group_by(street)%>%
summarize(n())
## # A tibble: 5 x 2
## street `n()`
## <chr> <int>
## 1 Chicago Ave 3222
## 2 E 79th St 37827
## 3 E Chicago Ave 2293
## 4 W 79th St 13842
## 5 W Chicago Ave 29932
event_data <- event_data%>%
mutate("corridor" = ifelse(street %in% c("E 79th St", "W 79th St"), "79th St", "Chicago Ave"),
event_date_time = as.POSIXct (pubMillis/1000, origin="1970-01-01")
)
ggplot(event_data,
aes(hour(event_date_time),
fill = corridor)) +
geom_bar(position = "dodge")
connection <- DBI::dbConnect(RMySQL::MySQL(),
user = "ppha30531",
dbname="Waze2",
port = 3306,
password = "bUYjwnKXf49M2pb",
host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)
chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
event_data_comparison <- chi_alerts_sql %>%
filter(street %in% c("E 87th St", "W 87th St", "E Division St", "W Division St", "Division St")) %>%
collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data_comparison, file = "event_data.rds")
event_data_comparison <- readRDS(file = "event_data.rds")
# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,2>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data_comparison%>%
group_by(street)%>%
summarize(n())
## # A tibble: 5 x 2
## street `n()`
## <chr> <int>
## 1 Division St 726
## 2 E 87th St 1043
## 3 E Division St 78
## 4 W 87th St 10818
## 5 W Division St 21531
event_data_comparison <- event_data_comparison%>%
mutate("corridor" = ifelse(street %in% c("E 87th St", "W 87th St"), "87th St", "Division St"),
event_date_time = as.POSIXct (pubMillis/1000, origin="1970-01-01")
)
ggplot(event_data_comparison,
aes(hour(event_date_time),
fill = corridor)) +
geom_bar(position = "dodge")
a. Looking beyond traffic, what other alerts are very common in this area? Do you think these alerts would slow down the 66 / 79? If so, what steps could the City take to address the issues?
ggplot(event_data,
aes(hour(event_date_time),
fill = type)) +
geom_bar(position = "dodge")
Besides traffic jams, weather hazards and accidents are also common. These would definitely slow down traffic for the buses. There also seems to be a case where 79th street was closed, triggering many event reports.
connection <- DBI::dbConnect(RMySQL::MySQL(),
user = "ppha30531",
dbname="Waze2",
port = 3306,
password = "bUYjwnKXf49M2pb",
host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)
chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
single_event_data <- chi_alerts_sql %>%
filter(uuid == "c5a73cc6-5242-3172-be5a-cf8990d70cb2") %>%
collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data_comparison, file = "single_event_data.rds")
single_event_data <- readRDS(file = "single_event_data.rds")
# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,3>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data_comparison
## # A tibble: 34,196 x 19
## country nTHumbsUp city reportRating confidence reliability type uuid
## <chr> <int> <chr> <int> <int> <int> <chr> <chr>
## 1 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 2 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 3 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 4 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 5 US 0 Chic… 3 0 5 WEAT… bdfc…
## 6 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 7 US 0 Chic… 3 0 5 WEAT… bdfc…
## 8 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 9 US 0 Chic… 3 0 5 WEAT… bdfc…
## 10 US 0 Chic… 1 0 5 WEAT… 870c…
## # … with 34,186 more rows, and 11 more variables: roadType <int>,
## # magvar <int>, subtype <chr>, street <chr>, location_x <dbl>,
## # location_y <dbl>, pubMillis <dbl>, reportDescription <chr>,
## # scrape_dt <chr>, corridor <chr>, event_date_time <dttm>
connection <- DBI::dbConnect(RMySQL::MySQL(),
user = "ppha30531",
dbname="Waze2",
port = 3306,
password = "bUYjwnKXf49M2pb",
host = "uchicagowazereplica2.cfykgneqoh8w.us-west-2.rds.amazonaws.com"
)
chi_alerts_sql <- tbl(connection, "chiAlerts")
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
single_event_data <- chi_alerts_sql %>%
filter(-87.624138 <= location_x, location_x <= -87.612916,
41.857933 <= location_y, location_y <= 41.867) %>%
collect()
## Warning in .local(conn, statement, ...): Decimal MySQL column 12 imported
## as numeric
## Warning in .local(conn, statement, ...): Decimal MySQL column 13 imported
## as numeric
# Save collected data as an rds file, access and read it
saveRDS(event_data_comparison, file = "single_event_data.rds")
single_event_data <- readRDS(file = "single_event_data.rds")
# View active connections
dbListConnections(MySQL())
## [[1]]
## <MySQLConnection:0,4>
# End active connections
lapply( dbListConnections(MySQL()), function(x) dbDisconnect(x) )
## [[1]]
## [1] TRUE
event_data_comparison
## # A tibble: 34,196 x 19
## country nTHumbsUp city reportRating confidence reliability type uuid
## <chr> <int> <chr> <int> <int> <int> <chr> <chr>
## 1 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 2 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 3 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 4 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 5 US 0 Chic… 3 0 5 WEAT… bdfc…
## 6 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 7 US 0 Chic… 3 0 5 WEAT… bdfc…
## 8 US 0 Chic… 4 0 6 WEAT… 6bfc…
## 9 US 0 Chic… 3 0 5 WEAT… bdfc…
## 10 US 0 Chic… 1 0 5 WEAT… 870c…
## # … with 34,186 more rows, and 11 more variables: roadType <int>,
## # magvar <int>, subtype <chr>, street <chr>, location_x <dbl>,
## # location_y <dbl>, pubMillis <dbl>, reportDescription <chr>,
## # scrape_dt <chr>, corridor <chr>, event_date_time <dttm>
single_event_bounds <- c(left = -87.624138, bottom = 41.857933 , right =
-87.612916, top = 41.867)
single_event_stamenmap <- get_stamenmap(data = event_data,
bbox = single_event_bounds,
zoom = 16,
maptype = "toner-lite")
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
single_event_map <- ggmap(single_event_stamenmap,
base_layer = ggplot(data = single_event_data)
)
single_event_map
single_event_map +
geom_point(aes(location_x, location_y), color = "red")
## Warning: Removed 34196 rows containing missing values (geom_point).
single_event_data <- single_event_data%>%
mutate(event_date_time = as.POSIXct (pubMillis/1000, origin="1970-01-01"))
Plot the number of jams 6AM-6PM CST. Why are there two humps?
Place one vertical line at each hump.
Next, propose a quantitative measure of traffic jam severity that combines the number of traffic JAM alerts with information in the subtype variable.
Plot this measure from 6AM-6PM CST. Is there any information that is conveyed by your severity measure that was not captured by plotting the number of jams? If so, what is it?
Pick one major accident. What is the uuid? Sample alerts from the two hours before the accident first appeared in the data and two hours after the accident for a geographic box of 0.1 miles around the accident. Make a plot where the y-axis is the number of traffic jam alerts and the x-axis is the five-minute interval from two hours before the accident to two hours after the accident. Warning: This question is harder than it first appears. You might want to review R4DS chapter 12.5 (lecture note 5) on missing values and chapter 16.4 (lecture note 9).
Building on your work for the prior question, write a function that takes as its arguments uuid, a date-time, a latitude and a longitude and returns a data frame with the number of alerts in each five-minute interval from two hours before to two hours after.
Make a data frame with every major accident on Nov 20, 2017. Feed each row of this data frame to your function. Collapse the output into the mean number of traffic jam alerts in each five-minute interval in the two hours before the accident and two hours after the accident for a geographic box of 0.1 miles. Tip: This may take upwards of 20 minutes to run on all major accidents. Use your function on a small sample of accidents first to make sure your code is working as expected before trying to run on all accidents.
Plot the mean number of jam alerts around major accident. To be clear, the correct answer here is a single plot that summarizes jams across major accidents, not one plot for each accident. Congratulations! This is your first event study.